The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
It is quite easy to setup, it takes me less than 15 minutes to finish a submission.
kaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
!pip install kaggle
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/site-packages (1.5.12) Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/site-packages (from kaggle) (5.0.2) Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/site-packages (from kaggle) (1.26.6) Requirement already satisfied: requests in /usr/local/lib/python3.7/site-packages (from kaggle) (2.25.1) Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/site-packages (from kaggle) (2.8.2) Requirement already satisfied: certifi in /usr/local/lib/python3.7/site-packages (from kaggle) (2021.5.30) Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/site-packages (from kaggle) (1.15.0) Requirement already satisfied: tqdm in /usr/local/lib/python3.7/site-packages (from kaggle) (4.62.1) Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/site-packages (from requests->kaggle) (2.10) Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from requests->kaggle) (4.0.0) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available. You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
!pwd
/root/shared/Documents/Indiana Coursework/INFO-I 526 Applied ML/AML-project
!mkdir ~/.kaggle
!cp /root/shared/Downloads/kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
mkdir: cannot create directory ‘/root/.kaggle’: File exists
! kaggle competitions files home-credit-default-risk
name size creationDate ---------------------------------- ----- ------------------- application_test.csv 25MB 2019-12-11 02:55:35 application_train.csv 158MB 2019-12-11 02:55:35 HomeCredit_columns_description.csv 37KB 2019-12-11 02:55:35 bureau_balance.csv 358MB 2019-12-11 02:55:35 installments_payments.csv 690MB 2019-12-11 02:55:35 previous_application.csv 386MB 2019-12-11 02:55:35 POS_CASH_balance.csv 375MB 2019-12-11 02:55:35 bureau.csv 162MB 2019-12-11 02:55:35 sample_submission.csv 524KB 2019-12-11 02:55:35 credit_card_balance.csv 405MB 2019-12-11 02:55:35
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
There are 7 different sources of data:
# 
Create a base directory:
DATA_DIR = "../../../Data/home-credit-default-risk" #same level as course repo in the data directory
Please download the project data files and data dictionary and unzip them using either of the following approaches:
Download button on the following Data Webpage and unzip the zip file to the BASE_DIRDATA_DIR = "../Data/home-credit-default-risk" #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
!mkdir $DATA_DIR
mkdir: cannot create directory ‘../Data/home-credit-default-risk’: File exists
!ls -l $DATA_DIR
total 3332204 -rw-rw-r-- 1 root root 37383 Dec 11 2019 HomeCredit_columns_description.csv -rw-rw-r-- 1 root root 392703158 Dec 11 2019 POS_CASH_balance.csv -rw-rw-r-- 1 root root 26567651 Dec 11 2019 application_test.csv -rw-rw-r-- 1 root root 166133370 Dec 11 2019 application_train.csv -rw-rw-r-- 1 root root 170016717 Dec 11 2019 bureau.csv -rw-rw-r-- 1 root root 375592889 Dec 11 2019 bureau_balance.csv -rw-rw-r-- 1 root root 424582605 Dec 11 2019 credit_card_balance.csv -rw-r--r-- 1 root root 721616255 Nov 10 07:15 home-credit-default-risk.zip -rw-rw-r-- 1 root root 723118349 Dec 11 2019 installments_payments.csv -rw-rw-r-- 1 root root 404973293 Dec 11 2019 previous_application.csv -rw-rw-r-- 1 root root 536202 Dec 11 2019 sample_submission.csv
! kaggle competitions download home-credit-default-risk -p $DATA_DIR
home-credit-default-risk.zip: Skipping, found more recently modified local copy (use --force to force download)
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
unzippingReq = False
if unzippingReq: #please modify this code
zip_ref = zipfile.ZipFile('application_train.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('application_test.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('bureau_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('bureau.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('credit_card_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('installments_payments.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('POS_CASH_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('previous_application.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_train'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
datasets['application_train'].shape
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
(307511, 122)
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
The application dataset has the most information about the client: Gender, income, family status, education ...
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
bureau: shape is (1716428, 17) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB None
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
bureau_balance: shape is (27299925, 3) <class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB None
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
credit_card_balance: shape is (3840312, 23) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | ... | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | ... | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | ... | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | ... | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | ... | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
5 rows × 23 columns
installments_payments: shape is (13605401, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB None
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
previous_application: shape is (1670214, 37) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB None
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
POS_CASH_balance: shape is (10001358, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
CPU times: user 40.8 s, sys: 43.3 s, total: 1min 24s Wall time: 1min 31s
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
dataset application_train : [ 307,511, 122] dataset application_test : [ 48,744, 121] dataset bureau : [ 1,716,428, 17] dataset bureau_balance : [ 27,299,925, 3] dataset credit_card_balance : [ 3,840,312, 23] dataset installments_payments : [ 13,605,401, 8] dataset previous_application : [ 1,670,214, 37] dataset POS_CASH_balance : [ 10,001,358, 8]
print('\033[1m' + "Size of each dataset : " + '\033[0m' , end = '\n' * 2)
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]:4}]')
Size of each dataset :
dataset application_train : [ 307,511, 122]
dataset application_test : [ 48,744, 121]
dataset bureau : [ 1,716,428, 17]
dataset bureau_balance : [ 27,299,925, 3]
dataset credit_card_balance : [ 3,840,312, 23]
dataset installments_payments : [ 13,605,401, 8]
dataset previous_application : [ 1,670,214, 37]
dataset POS_CASH_balance : [ 10,001,358, 8]
datasets["application_train"].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
datasets["application_train"].describe() #numerical only features
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
datasets["application_test"].describe() #numerical only features
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
datasets["application_train"].describe(include='all') #look at all categorical and numerical
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511 | 307511 | 307511 | 307511 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| unique | NaN | NaN | 2 | 3 | 2 | 2 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | Cash loans | F | N | Y | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | 278232 | 202448 | 202924 | 213312 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 278180.518577 | 0.080729 | NaN | NaN | NaN | NaN | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | NaN | NaN | NaN | NaN | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | NaN | NaN | NaN | NaN | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | NaN | NaN | NaN | NaN | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
11 rows × 122 columns
from IPython.display import display, HTML
pd.set_option("display.max_rows", None, "display.max_columns", None)
# Full stats
def stats_summary1(df, df_name):
print(datasets[df_name].info(verbose=True, null_counts=True ))
print("-----"*15)
print(f"Shape of the df {df_name} is {df.shape} \n")
print("-----"*15)
print(f"Statistical summary of {df_name} is :")
print("-----"*15)
print(f"Description of the df {df_name}:\n")
print(display(HTML(np.round(datasets['application_train'].describe(),2).to_html())))
#print(f"Description of the df {df_name}:\n",np.round(datasets['application_train'].describe(),2))
def stats_summary2(df, df_name):
print(f"Description of the df continued for {df_name}:\n")
print("-----"*15)
print("Data type value counts: \n",df.dtypes.value_counts())
print("\nReturn number of unique elements in the object. \n")
print(df.select_dtypes('object').apply(pd.Series.nunique, axis = 0))
# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
print("-----"*15)
print(f"Categorical and Numerical(int + float) features of {df_name}.")
print("-----"*15)
print()
for k, v in df_dtypes.items():
print({k.name: v})
print("---"*10)
print("\n \n")
# Null data list and plot.
def null_data_plot(df, df_name):
percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = df.isna().sum().sort_values(ascending = False)
missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_data=missing_data[missing_data['Percent'] > 0]
print("-----"*15)
print("-----"*15)
print('\n The Missing Data: \n')
# display(missing_data) # display few
if len(missing_data)==0:
print("No missing Data")
else:
display(HTML(missing_data.to_html())) # display all the rows
print("-----"*15)
if len(df.columns)> 35:
f,ax =plt.subplots(figsize=(8,15))
else:
f,ax =plt.subplots()
#plt.xticks(rotation='90')
#fig=sns.barplot(missing_data.index, missing_data["Percent"],alpha=0.8)
#plt.xlabel('Features', fontsize=15)
#plt.ylabel('Percent of missing values', fontsize=15)
plt.title(f'Percent missing data for {df_name}.', fontsize=10)
fig=sns.barplot(missing_data["Percent"],missing_data.index ,alpha=0.8)
plt.xlabel('Percent of missing values', fontsize=10)
plt.ylabel('Features', fontsize=10)
plt.show()
return missing_data
# Full consolidation of all the stats function.
def display_stats(df, df_name):
print("--"*40)
print(" "*20 + '\033[1m'+ df_name + '\033[0m' +" "*20)
print("--"*40)
stats_summary1(df, df_name)
def display_feature_info(df, df_name):
stats_summary2(df, df_name)
feature_datatypes_groups(df, df_name)
null_data_plot(df, df_name)
display_feature_info(datasets['application_train'], 'application_train')
Description of the df continued for application_train:
---------------------------------------------------------------------------
Data type value counts:
float64 65
int64 41
object 16
dtype: int64
Return number of unique elements in the object.
NAME_CONTRACT_TYPE 2
CODE_GENDER 3
FLAG_OWN_CAR 2
FLAG_OWN_REALTY 2
NAME_TYPE_SUITE 7
NAME_INCOME_TYPE 8
NAME_EDUCATION_TYPE 5
NAME_FAMILY_STATUS 6
NAME_HOUSING_TYPE 6
OCCUPATION_TYPE 18
WEEKDAY_APPR_PROCESS_START 7
ORGANIZATION_TYPE 58
FONDKAPREMONT_MODE 4
HOUSETYPE_MODE 3
WALLSMATERIAL_MODE 7
EMERGENCYSTATE_MODE 2
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features of application_train.
---------------------------------------------------------------------------
{'int64': Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21'],
dtype='object')}
------------------------------
{'float64': Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE',
'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE',
'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE',
'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI',
'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE',
'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
dtype='object')}
------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
The Missing Data:
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
| BASEMENTAREA_MEDI | 58.52 | 179943 |
| BASEMENTAREA_AVG | 58.52 | 179943 |
| BASEMENTAREA_MODE | 58.52 | 179943 |
| EXT_SOURCE_1 | 56.38 | 173378 |
| NONLIVINGAREA_MODE | 55.18 | 169682 |
| NONLIVINGAREA_AVG | 55.18 | 169682 |
| NONLIVINGAREA_MEDI | 55.18 | 169682 |
| ELEVATORS_MEDI | 53.30 | 163891 |
| ELEVATORS_AVG | 53.30 | 163891 |
| ELEVATORS_MODE | 53.30 | 163891 |
| WALLSMATERIAL_MODE | 50.84 | 156341 |
| APARTMENTS_MEDI | 50.75 | 156061 |
| APARTMENTS_AVG | 50.75 | 156061 |
| APARTMENTS_MODE | 50.75 | 156061 |
| ENTRANCES_MEDI | 50.35 | 154828 |
| ENTRANCES_AVG | 50.35 | 154828 |
| ENTRANCES_MODE | 50.35 | 154828 |
| LIVINGAREA_AVG | 50.19 | 154350 |
| LIVINGAREA_MODE | 50.19 | 154350 |
| LIVINGAREA_MEDI | 50.19 | 154350 |
| HOUSETYPE_MODE | 50.18 | 154297 |
| FLOORSMAX_MODE | 49.76 | 153020 |
| FLOORSMAX_MEDI | 49.76 | 153020 |
| FLOORSMAX_AVG | 49.76 | 153020 |
| YEARS_BEGINEXPLUATATION_MODE | 48.78 | 150007 |
| YEARS_BEGINEXPLUATATION_MEDI | 48.78 | 150007 |
| YEARS_BEGINEXPLUATATION_AVG | 48.78 | 150007 |
| TOTALAREA_MODE | 48.27 | 148431 |
| EMERGENCYSTATE_MODE | 47.40 | 145755 |
| OCCUPATION_TYPE | 31.35 | 96391 |
| EXT_SOURCE_3 | 19.83 | 60965 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_DAY | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_MON | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_QRT | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 13.50 | 41519 |
| NAME_TYPE_SUITE | 0.42 | 1292 |
| OBS_30_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| DEF_30_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| OBS_60_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| DEF_60_CNT_SOCIAL_CIRCLE | 0.33 | 1021 |
| EXT_SOURCE_2 | 0.21 | 660 |
| AMT_GOODS_PRICE | 0.09 | 278 |
---------------------------------------------------------------------------
We can see from the descriptive statistics for Days Birth, Days employed, Days registration, Days Id publish which is a negative value and is not expected.
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
| Percent | Test Missing Count | |
|---|---|---|
| COMMONAREA_AVG | 68.72 | 33495 |
| COMMONAREA_MODE | 68.72 | 33495 |
| COMMONAREA_MEDI | 68.72 | 33495 |
| NONLIVINGAPARTMENTS_AVG | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MODE | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MEDI | 68.41 | 33347 |
| FONDKAPREMONT_MODE | 67.28 | 32797 |
| LIVINGAPARTMENTS_AVG | 67.25 | 32780 |
| LIVINGAPARTMENTS_MODE | 67.25 | 32780 |
| LIVINGAPARTMENTS_MEDI | 67.25 | 32780 |
| FLOORSMIN_MEDI | 66.61 | 32466 |
| FLOORSMIN_AVG | 66.61 | 32466 |
| FLOORSMIN_MODE | 66.61 | 32466 |
| OWN_CAR_AGE | 66.29 | 32312 |
| YEARS_BUILD_AVG | 65.28 | 31818 |
| YEARS_BUILD_MEDI | 65.28 | 31818 |
| YEARS_BUILD_MODE | 65.28 | 31818 |
| LANDAREA_MEDI | 57.96 | 28254 |
| LANDAREA_AVG | 57.96 | 28254 |
| LANDAREA_MODE | 57.96 | 28254 |
Explore the distribution of values taken on by the target variable.
datasets["application_train"].groupby(['TARGET'])['SK_ID_CURR'].count()
TARGET 0 282686 1 24825 Name: SK_ID_CURR, dtype: int64
datasets["application_train"]['TARGET'].plot.hist()
plt.show()
df_app_train=datasets["application_train"].copy()
df_app_train['DAYS_EMPLOYED_ANOM'] = df_app_train['DAYS_EMPLOYED'] == 365243
df_app_train['DAYS_EMPLOYED'].replace({365243:np.nan}, inplace=True)
plt.hist(df_app_train['DAYS_EMPLOYED'],edgecolor = 'k', bins = 25)
plt.title('DAYS_EMPLOYED'); plt.xlabel('No Of Days as per Dataset'); plt.ylabel('Count');
plt.show()
Number of Days employed is an important feature that can be used for predicting risk. However, the histogram shows that the data is not logical.
plt.hist(datasets["application_train"]['OWN_CAR_AGE'],edgecolor = 'k', bins = 25)
plt.title('OWN CAR AGE'); plt.xlabel('No Of Days as per Dataset'); plt.ylabel('Count');
plt.show()
There are number of applications that we can see from the histogram for those who have cars over 60 years old.
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
pos = datasets["application_train"][['TARGET','DAYS_LAST_PHONE_CHANGE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'DAYS_BIRTH']]
pos_corr = pos.corr()
sns.heatmap(pos_corr, annot = True, cmap='viridis')
plt.show()
Most Positive Correlations: FLAG_DOCUMENT_3 0.044346 REG_CITY_NOT_LIVE_CITY 0.044395 FLAG_EMP_PHONE 0.045982 REG_CITY_NOT_WORK_CITY 0.050994 DAYS_ID_PUBLISH 0.051457 DAYS_LAST_PHONE_CHANGE 0.055218 REGION_RATING_CLIENT 0.058899 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_BIRTH 0.078239 TARGET 1.000000 Name: TARGET, dtype: float64
print('\nMost Negative Correlations:\n', correlations.head(10))
neg = datasets["application_train"][['TARGET','EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_EMPLOYED']]
neg_corr = neg.corr()
sns.heatmap(neg_corr, annot = True, cmap='viridis')
plt.show()
Most Negative Correlations: EXT_SOURCE_3 -0.178919 EXT_SOURCE_2 -0.160472 EXT_SOURCE_1 -0.155317 DAYS_EMPLOYED -0.044932 FLOORSMAX_AVG -0.044003 FLOORSMAX_MEDI -0.043768 FLOORSMAX_MODE -0.043226 AMT_GOODS_PRICE -0.039645 REGION_POPULATION_RELATIVE -0.037227 ELEVATORS_AVG -0.034199 Name: TARGET, dtype: float64
The distribution of the top correlated features are plotted below.
var_neg_corr = correlations.head(10).index.values
numVar = var_neg_corr.shape[0]
plt.figure(figsize=(15,20))
for i,var in enumerate(var_neg_corr):
dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
plt.subplot(numVar,4,i+1)
datasets["application_train"][var].hist()
plt.title(var, fontsize = 10)
plt.tight_layout()
plt.show()
Density plots of correlated features are plotted below
var_neg_corr = correlations.head(10).index.values
numVar = var_neg_corr.shape[0]
plt.figure(figsize=(10,40))
for i,var in enumerate(var_neg_corr):
dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
plt.subplot(numVar,3,i+1)
plt.subplots_adjust(wspace=2)
sns.kdeplot(dflt_var,label='Default')
sns.kdeplot(dflt_non_var,label='No Default')
#plt.xlabel(var)
plt.ylabel('Density')
plt.title(var, fontsize = 10)
plt.tight_layout()
plt.show()
plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');
plt.show()
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"]);
plt.title('Applicants Occupation');
plt.xticks(rotation=90);
plt.show()
sns.histplot(x='AMT_CREDIT', data=datasets["application_train"], bins=50);
plt.title('Distribution of AMT_CREDIT');
plt.show()
#Credit amounts are right skewed and outlier exists
fig, ax = plt.subplots(figsize=(10, 10))
f = sns.scatterplot(data = datasets["application_train"], x = 'AMT_INCOME_TOTAL', y = 'AMT_CREDIT', hue = 'TARGET')
f.set(xlim=(0, 1000000))
plt.show()
plt.subplots(figsize=(15, 15))
d = sns.boxplot(x = datasets["application_train"]['NAME_EDUCATION_TYPE'],
y = datasets["application_train"]['AMT_CREDIT'],
hue = datasets["application_train"]['NAME_FAMILY_STATUS'], palette="Set3")
d.set(ylim=(0, 2000000))
plt.xticks(rotation=90)
plt.show()
datasets.keys()
dict_keys(['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance'])
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
True
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
array([], dtype=int64)
datasets["application_test"].shape
(48744, 121)
datasets["application_train"].shape
(307511, 122)
The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.
appsDF = datasets["previous_application"]
appsDF.shape
(1670214, 37)
len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"]))
47800
print(f"There are {appsDF.shape[0]:,} previous applications")
There are 1,670,214 previous applications
# How many entries are there for each month?
prevAppCounts = appsDF['SK_ID_CURR'].value_counts(dropna=False)
#prevAppCounts
len(prevAppCounts[prevAppCounts >40]) #more that 40 previous applications
101
prevAppCounts[prevAppCounts >50].plot(kind='bar')
plt.xticks(rotation=25)
plt.show()
sum(appsDF['SK_ID_CURR'].value_counts()==1)
60458
plt.hist(appsDF['SK_ID_CURR'].value_counts(), cumulative =True, bins = 100);
plt.grid()
plt.ylabel('cumulative number of IDs')
plt.xlabel('Number of previous applications per ID')
plt.title('Histogram of Number of previous applications for an ID')
plt.show()
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)
apps_all = appsDF['SK_ID_CURR'].nunique()
apps_5plus = appsDF['SK_ID_CURR'].value_counts()>=5
apps_40plus = appsDF['SK_ID_CURR'].value_counts()>=40
print('Percentage with 10 or more previous apps:', np.round(100.*(sum(apps_5plus)/apps_all),5))
print('Percentage with 40 or more previous apps:', np.round(100.*(sum(apps_40plus)/apps_all),5))
Percentage with 10 or more previous apps: 41.76895 Percentage with 40 or more previous apps: 0.03453
In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?
previous_application with application_x¶We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.
Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:
AMT_APPLICATION, AMT_CREDIT could be based on average, min, max, median, etc.To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).
When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]I want you to think about this section and build on this.
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset)), thereby leading to X_train, y_train, X_valid, etc.appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Canceled | -14 | XNA | XAP | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704) & ~(appsDF["AMT_CREDIT"]==1.0)]
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Canceled | -14 | XNA | XAP | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
appsDF.isna().sum()
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 372235 AMT_APPLICATION 0 AMT_CREDIT 1 AMT_DOWN_PAYMENT 895844 AMT_GOODS_PRICE 385515 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 RATE_DOWN_PAYMENT 895844 RATE_INTEREST_PRIMARY 1664263 RATE_INTEREST_PRIVILEGED 1664263 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 0 DAYS_DECISION 0 NAME_PAYMENT_TYPE 0 CODE_REJECT_REASON 0 NAME_TYPE_SUITE 820405 NAME_CLIENT_TYPE 0 NAME_GOODS_CATEGORY 0 NAME_PORTFOLIO 0 NAME_PRODUCT_TYPE 0 CHANNEL_TYPE 0 SELLERPLACE_AREA 0 NAME_SELLER_INDUSTRY 0 CNT_PAYMENT 372230 NAME_YIELD_GROUP 0 PRODUCT_COMBINATION 346 DAYS_FIRST_DRAWING 673065 DAYS_FIRST_DUE 673065 DAYS_LAST_DUE_1ST_VERSION 673065 DAYS_LAST_DUE 673065 DAYS_TERMINATION 673065 NFLAG_INSURED_ON_APPROVAL 673065 dtype: int64
appsDF.columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE',
'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')
# features = ['AMT_ANNUITY', 'AMT_APPLICATION']
# print(f"{appsDF[features].describe()}")
# agg_ops = ["min", "max", "mean"]
# result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg("mean") #group by ID
# display(result.head())
# print("-"*50)
# result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg({'AMT_ANNUITY' : agg_ops, 'AMT_APPLICATION' : agg_ops})
# result.columns = result.columns.map('_'.join)
# display(result)
# result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
# print(f"result.shape: {result.shape}")
# result[0:10]
AMT_ANNUITY AMT_APPLICATION count 1.297979e+06 1.670214e+06 mean 1.595512e+04 1.752339e+05 std 1.478214e+04 2.927798e+05 min 0.000000e+00 0.000000e+00 25% 6.321780e+03 1.872000e+04 50% 1.125000e+04 7.104600e+04 75% 2.065842e+04 1.803600e+05 max 4.180581e+05 6.905160e+06
| SK_ID_CURR | SK_ID_PREV | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 1.369693e+06 | 3951.000 | 24835.50 | 23787.00 | 2520.0 | 24835.5 | 13.000000 | 1.0 | 0.104326 | NaN | NaN | -1740.0 | 23.0 | 8.0 | 365243.0 | -1709.000000 | -1499.000000 | -1619.000000 | -1612.000000 | 0.000000 |
| 1 | 100002 | 1.038818e+06 | 9251.775 | 179055.00 | 179055.00 | 0.0 | 179055.0 | 9.000000 | 1.0 | 0.000000 | NaN | NaN | -606.0 | 500.0 | 24.0 | 365243.0 | -565.000000 | 125.000000 | -25.000000 | -17.000000 | 0.000000 |
| 2 | 100003 | 2.281150e+06 | 56553.990 | 435436.50 | 484191.00 | 3442.5 | 435436.5 | 14.666667 | 1.0 | 0.050030 | NaN | NaN | -1305.0 | 533.0 | 10.0 | 365243.0 | -1274.333333 | -1004.333333 | -1054.333333 | -1047.333333 | 0.666667 |
| 3 | 100004 | 1.564014e+06 | 5357.250 | 24282.00 | 20106.00 | 4860.0 | 24282.0 | 5.000000 | 1.0 | 0.212008 | NaN | NaN | -815.0 | 30.0 | 4.0 | 365243.0 | -784.000000 | -694.000000 | -724.000000 | -714.000000 | 0.000000 |
| 4 | 100005 | 2.176837e+06 | 4813.200 | 22308.75 | 20076.75 | 4464.0 | 44617.5 | 10.500000 | 1.0 | 0.108964 | NaN | NaN | -536.0 | 18.0 | 12.0 | 365243.0 | -706.000000 | -376.000000 | -466.000000 | -460.000000 | 0.000000 |
--------------------------------------------------
# result.isna().sum()
# # Create aggregate features (via pipeline)
# class prevAppsFeaturesAggregater(BaseEstimator, TransformerMixin):
# def __init__(self, features=None): # no *args or **kargs
# self.features = features
# self.agg_op_features = ["min", "max", "mean"]
# def fit(self, X, y=None):
# return self
# def transform(self, X, y=None):
# #from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
# result = X.groupby(["SK_ID_CURR"], as_index=False).agg("mean") #group by ID
# result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg({'AMT_ANNUITY' : self.agg_op_features, 'AMT_APPLICATION' : self.agg_op_features})
# result.columns = result.columns.map('_'.join)
# #display(result)
# #result = result.reset_index(level=["SK_ID_CURR"])
# result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
# return result # return dataframe with the join key "SK_ID_CURR"
# # result[0:10]
# from sklearn.pipeline import make_pipeline
# def test_driver_prevAppsFeaturesAggregater(df, features):
# print(f"df.shape: {df.shape}\n")
# print(f"df[{features}][0:5]: \n{df[features][0:5]}")
# test_pipeline = make_pipeline(prevAppsFeaturesAggregater(features))
# return(test_pipeline.fit_transform(df))
# features = ['AMT_ANNUITY', 'AMT_APPLICATION']
# features = ['AMT_ANNUITY',
# 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
# 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
# 'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
# 'CNT_PAYMENT',
# 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
# 'DAYS_LAST_DUE', 'DAYS_TERMINATION']
# features = ['AMT_ANNUITY', 'AMT_APPLICATION']
# res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
# print(f"HELLO")
# print(f"Test driver: \n{res[0:10]}")
# print(f"input[features][0:10]: \n{appsDF[0:10]}")
# # QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
df.shape: (1670214, 37)
df[['AMT_ANNUITY', 'AMT_APPLICATION']][0:5]:
AMT_ANNUITY AMT_APPLICATION
0 1730.430 17145.0
1 25188.615 607500.0
2 15060.735 112500.0
3 47041.335 450000.0
4 31924.395 337500.0
HELLO
Test driver:
SK_ID_CURR_ AMT_ANNUITY_min AMT_ANNUITY_max AMT_ANNUITY_mean \
0 100001 3951.000 3951.000 3951.000000
1 100002 9251.775 9251.775 9251.775000
2 100003 6737.310 98356.995 56553.990000
3 100004 5357.250 5357.250 5357.250000
4 100005 4813.200 4813.200 4813.200000
5 100006 2482.920 39954.510 23651.175000
6 100007 1834.290 22678.785 12278.805000
7 100008 8019.090 25309.575 15839.696250
8 100009 7435.845 17341.605 10051.412143
9 100010 27463.410 27463.410 27463.410000
AMT_APPLICATION_min AMT_APPLICATION_max AMT_APPLICATION_mean \
0 24835.5 24835.5 24835.500000
1 179055.0 179055.0 179055.000000
2 68809.5 900000.0 435436.500000
3 24282.0 24282.0 24282.000000
4 0.0 44617.5 22308.750000
5 0.0 688500.0 272203.260000
6 17176.5 247500.0 150530.250000
7 0.0 450000.0 155701.800000
8 40455.0 110160.0 76741.714286
9 247212.0 247212.0 247212.000000
range_AMT_APPLICATION
0 0.0
1 0.0
2 831190.5
3 0.0
4 44617.5
5 688500.0
6 230323.5
7 450000.0
8 69705.0
9 0.0
input[features][0:10]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION \
0 2030495 271877 Consumer loans 1730.430 17145.0
1 2802425 108129 Cash loans 25188.615 607500.0
2 2523466 122040 Cash loans 15060.735 112500.0
3 2819243 176158 Cash loans 47041.335 450000.0
4 1784265 202054 Cash loans 31924.395 337500.0
5 1383531 199383 Cash loans 23703.930 315000.0
6 2315218 175704 Cash loans NaN 0.0
7 1656711 296299 Cash loans NaN 0.0
8 2367563 342292 Cash loans NaN 0.0
9 2579447 334349 Cash loans NaN 0.0
AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START \
0 17145.0 0.0 17145.0 SATURDAY
1 679671.0 NaN 607500.0 THURSDAY
2 136444.5 NaN 112500.0 TUESDAY
3 470790.0 NaN 450000.0 MONDAY
4 404055.0 NaN 337500.0 THURSDAY
5 340573.5 NaN 315000.0 SATURDAY
6 0.0 NaN NaN TUESDAY
7 0.0 NaN NaN MONDAY
8 0.0 NaN NaN MONDAY
9 0.0 NaN NaN SATURDAY
HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT \
0 15 Y
1 11 Y
2 11 Y
3 7 Y
4 9 Y
5 8 Y
6 11 Y
7 7 Y
8 15 Y
9 15 Y
NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY \
0 1 0.0 0.182832
1 1 NaN NaN
2 1 NaN NaN
3 1 NaN NaN
4 1 NaN NaN
5 1 NaN NaN
6 1 NaN NaN
7 1 NaN NaN
8 1 NaN NaN
9 1 NaN NaN
RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS \
0 0.867336 XAP Approved
1 NaN XNA Approved
2 NaN XNA Approved
3 NaN XNA Approved
4 NaN Repairs Refused
5 NaN Everyday expenses Approved
6 NaN XNA Canceled
7 NaN XNA Canceled
8 NaN XNA Canceled
9 NaN XNA Canceled
DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE \
0 -73 Cash through the bank XAP NaN
1 -164 XNA XAP Unaccompanied
2 -301 Cash through the bank XAP Spouse, partner
3 -512 Cash through the bank XAP NaN
4 -781 Cash through the bank HC NaN
5 -684 Cash through the bank XAP Family
6 -14 XNA XAP NaN
7 -21 XNA XAP NaN
8 -386 XNA XAP NaN
9 -57 XNA XAP NaN
NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE \
0 Repeater Mobile POS XNA
1 Repeater XNA Cash x-sell
2 Repeater XNA Cash x-sell
3 Repeater XNA Cash x-sell
4 Repeater XNA Cash walk-in
5 Repeater XNA Cash x-sell
6 Repeater XNA XNA XNA
7 Repeater XNA XNA XNA
8 Repeater XNA XNA XNA
9 Repeater XNA XNA XNA
CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY \
0 Country-wide 35 Connectivity
1 Contact center -1 XNA
2 Credit and cash offices -1 XNA
3 Credit and cash offices -1 XNA
4 Credit and cash offices -1 XNA
5 Credit and cash offices -1 XNA
6 Credit and cash offices -1 XNA
7 Credit and cash offices -1 XNA
8 Credit and cash offices -1 XNA
9 Credit and cash offices -1 XNA
CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING \
0 12.0 middle POS mobile with interest 365243.0
1 36.0 low_action Cash X-Sell: low 365243.0
2 12.0 high Cash X-Sell: high 365243.0
3 12.0 middle Cash X-Sell: middle 365243.0
4 24.0 high Cash Street: high NaN
5 18.0 low_normal Cash X-Sell: low 365243.0
6 NaN XNA Cash NaN
7 NaN XNA Cash NaN
8 NaN XNA Cash NaN
9 NaN XNA Cash NaN
DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION \
0 -42.0 300.0 -42.0 -37.0
1 -134.0 916.0 365243.0 365243.0
2 -271.0 59.0 365243.0 365243.0
3 -482.0 -152.0 -182.0 -177.0
4 NaN NaN NaN NaN
5 -654.0 -144.0 -144.0 -137.0
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
NFLAG_INSURED_ON_APPROVAL
0 0.0
1 1.0
2 1.0
3 1.0
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
9 NaN
Merge secondardy dataset with Primary dataset's (application_train) target variable to understand correlation between target variable and the secondary dataset's features.
def correlation_with_target(df):
app_train = datasets["application_train"].copy()
second_df = datasets[df].copy()
corr_matrix = pd.concat([app_train.TARGET, second_df], axis=1).corr().filter(second_df.columns).filter(app_train.columns, axis=0)
return corr_matrix
The following secondary datasets will be explored for correlation against the target variable.
for dataset in datasets.keys():
print(dataset)
application_train application_test bureau bureau_balance credit_card_balance installments_payments previous_application POS_CASH_balance
df_name = "bureau"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the bureau against the Target is :
DAYS_CREDIT_UPDATE 0.002159 DAYS_CREDIT_ENDDATE 0.002048 SK_ID_BUREAU 0.001550 DAYS_CREDIT 0.001443 AMT_CREDIT_SUM 0.000218 DAYS_ENDDATE_FACT 0.000203 AMT_ANNUITY 0.000189 AMT_CREDIT_MAX_OVERDUE -0.000389 CNT_CREDIT_PROLONG -0.000495 AMT_CREDIT_SUM_LIMIT -0.000558 AMT_CREDIT_SUM_DEBT -0.000946 SK_ID_CURR -0.001070 AMT_CREDIT_SUM_OVERDUE -0.001464 CREDIT_DAY_OVERDUE -0.001815 Name: TARGET, dtype: float64
df_name = "bureau_balance"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the bureau_balance against the Target is :
SK_ID_BUREAU 0.001223 MONTHS_BALANCE -0.005262 Name: TARGET, dtype: float64
df_name = "credit_card_balance"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the credit_card_balance against the Target is :
CNT_DRAWINGS_ATM_CURRENT 0.001908 AMT_DRAWINGS_ATM_CURRENT 0.001520 AMT_INST_MIN_REGULARITY 0.001435 SK_ID_CURR 0.001086 AMT_CREDIT_LIMIT_ACTUAL 0.000515 AMT_BALANCE 0.000448 SK_ID_PREV 0.000446 AMT_RECIVABLE 0.000412 AMT_TOTAL_RECEIVABLE 0.000407 AMT_RECEIVABLE_PRINCIPAL 0.000383 SK_DPD 0.000092 SK_DPD_DEF -0.000201 CNT_INSTALMENT_MATURE_CUM -0.000342 MONTHS_BALANCE -0.000768 AMT_PAYMENT_CURRENT -0.001129 AMT_PAYMENT_TOTAL_CURRENT -0.001395 AMT_DRAWINGS_CURRENT -0.001419 CNT_DRAWINGS_CURRENT -0.001764 CNT_DRAWINGS_OTHER_CURRENT -0.001833 CNT_DRAWINGS_POS_CURRENT -0.002387 AMT_DRAWINGS_OTHER_CURRENT -0.002672 AMT_DRAWINGS_POS_CURRENT -0.003518 Name: TARGET, dtype: float64
df_name = "installments_payments"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the installments_payments against the Target is :
SK_ID_PREV 0.002891 NUM_INSTALMENT_VERSION 0.002511 NUM_INSTALMENT_NUMBER 0.000626 SK_ID_CURR -0.000781 AMT_PAYMENT -0.003512 DAYS_INSTALMENT -0.003955 AMT_INSTALMENT -0.003972 DAYS_ENTRY_PAYMENT -0.004046 Name: TARGET, dtype: float64
df_name = "previous_application"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the previous_application against the Target is :
AMT_DOWN_PAYMENT 0.002496 CNT_PAYMENT 0.002341 DAYS_LAST_DUE_1ST_VERSION 0.001908 AMT_CREDIT 0.001833 AMT_APPLICATION 0.001689 AMT_GOODS_PRICE 0.001676 SK_ID_CURR 0.001107 NFLAG_INSURED_ON_APPROVAL 0.000879 RATE_DOWN_PAYMENT 0.000850 RATE_INTEREST_PRIMARY 0.000542 SK_ID_PREV 0.000362 DAYS_DECISION -0.000482 AMT_ANNUITY -0.000492 DAYS_FIRST_DUE -0.000943 SELLERPLACE_AREA -0.000954 DAYS_TERMINATION -0.001072 NFLAG_LAST_APPL_IN_DAY -0.001256 DAYS_FIRST_DRAWING -0.001293 DAYS_LAST_DUE -0.001940 HOUR_APPR_PROCESS_START -0.002285 RATE_INTEREST_PRIVILEGED -0.026427 Name: TARGET, dtype: float64
df_name = "POS_CASH_balance"
correlation_matrix = correlation_with_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the POS_CASH_balance against the Target is :
CNT_INSTALMENT_FUTURE 0.002811 MONTHS_BALANCE 0.002775 SK_ID_PREV 0.002164 CNT_INSTALMENT 0.001434 SK_DPD 0.000050 SK_ID_CURR -0.000136 SK_DPD_DEF -0.001362 Name: TARGET, dtype: float64
# Pipelines
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
class FeaturesAggregator(BaseEstimator, TransformerMixin):
def __init__(self, file_name=None, features=None, funcs=None): # no *args or **kargs
self.file_name = file_name
self.features = features
self.funcs = funcs
self.agg_op_features = {}
for f in self.features:
temp = {f"{file_name}_{f}_{func}":func for func in self.funcs}
self.agg_op_features[f]=[(k, v) for k, v in temp.items()]
print(self.agg_op_features)
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
result.columns = result.columns.droplevel()
result = result.reset_index(level=["SK_ID_CURR"])
return result # return dataframe with the join key "SK_ID_CURR"
agg_funcs = ['min', 'max', 'mean']
# prevApps = datasets['previous_application']
prevApps_features = ['AMT_ANNUITY', 'AMT_APPLICATION','AMT_DOWN_PAYMENT','CNT_PAYMENT','RATE_INTEREST_PRIVILEGED']
# bureau = datasets['bureau']
bureau_features = ['AMT_ANNUITY', 'AMT_CREDIT_SUM','DAYS_CREDIT','AMT_CREDIT_SUM_OVERDUE','CREDIT_DAY_OVERDUE']
# bureau_funcs = ['min', 'max', 'mean', 'count', 'sum']
# bureau_bal = datasets['bureau_balance']
bureau_bal_features = ['MONTHS_BALANCE']
# cc_bal = datasets['credit_card_balance']
cc_bal_features = ['MONTHS_BALANCE', 'AMT_BALANCE', 'CNT_INSTALMENT_MATURE_CUM','AMT_DRAWINGS_ATM_CURRENT'
,'AMT_INST_MIN_REGULARITY','AMT_PAYMENT_TOTAL_CURRENT']
# installments_pmnts = datasets['installments_payments']
installments_pmnts_features = ['AMT_INSTALMENT', 'AMT_PAYMENT']
pos_cash_balance_features = ['CNT_INSTALMENT_FUTURE','MONTHS_BALANCE','SK_DPD_DEF']
Engineer new features capturing relationship between income and credit amount as well as annuity and income for Application dataset
class engineer_features(BaseEstimator, TransformerMixin):
def __init__(self, features=None):
self
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# FROM APPLICATION
# Income Credit Percentage
X['ef_INCOME_CREDIT_PERCENT'] = (
X.AMT_INCOME_TOTAL / X.AMT_CREDIT).replace(np.inf, 0)
# Annuity as Percentage of Annual Income
X['ef_ANN_INCOME_PERCENT'] = (
X.AMT_ANNUITY / X.AMT_INCOME_TOTAL).replace(np.inf, 0)
return X
Engineer new features capturing range of annuity, application, and downpayment amounts from the Previous Application dataset
class prevApp_engineer_features(BaseEstimator, TransformerMixin):
def __init__(self, features=None):
self
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# FROM PREVIOUS APPLICATION
# Add Annuity, Application, and Downpaymet ranges
X['ef_prevApps_AMT_ANNUITY_range'] = (X.prevApps_AMT_ANNUITY_max - X.prevApps_AMT_ANNUITY_min).replace(np.inf, 0)
X['ef_prevApps_AMT_APPLICATION_range'] = (X.prevApps_AMT_APPLICATION_max - X.prevApps_AMT_APPLICATION_min).replace(np.inf, 0)
X['ef_prevApps_AMT_DOWN_PAYMENT_range'] = (X.prevApps_AMT_DOWN_PAYMENT_max - X.prevApps_AMT_DOWN_PAYMENT_min).replace(np.inf, 0)
return X
from sklearn.pipeline import make_pipeline, Pipeline
prevApps_features_pipeline = Pipeline([
# ('prevApps_add_features1', prevApps_add_features1()), # add some new features
# ('prevApps_add_features2', prevApps_add_features2()), # add some new features
('prevApps_aggregator', FeaturesAggregator('prevApps', prevApps_features, agg_funcs)), # Aggregate across old and new features
('prevApp_engineer_features', prevApp_engineer_features())
])
bureau_features_pipeline = Pipeline([
('bureau_aggregator', FeaturesAggregator('bureau', bureau_features, agg_funcs)), # Aggregate across old and new features
])
bureau_bal_features_pipeline = Pipeline([
('bureau_bal_aggregator', FeaturesAggregator('bureau_balance', bureau_bal_features , agg_funcs)), # Aggregate across old and new features
])
cc_bal_features_pipeline = Pipeline([
('cc_bal_aggregator', FeaturesAggregator('credit_card_balance', cc_bal_features , agg_funcs)), # Aggregate across old and new features
])
installments_pmnts_features_pipeline = Pipeline([
('installments_pmnts_features_aggregator', FeaturesAggregator('installments_pmnts', installments_pmnts_features , agg_funcs)), # Aggregate across old and new features
])
pos_cash_balance_features_pipeline = Pipeline([
('pos_cash_balance_features_aggregator', FeaturesAggregator('pos_cash_balance', pos_cash_balance_features , agg_funcs)), # Aggregate across old and new features
])
{'AMT_ANNUITY': [('prevApps_AMT_ANNUITY_min', 'min'), ('prevApps_AMT_ANNUITY_max', 'max'), ('prevApps_AMT_ANNUITY_mean', 'mean')], 'AMT_APPLICATION': [('prevApps_AMT_APPLICATION_min', 'min'), ('prevApps_AMT_APPLICATION_max', 'max'), ('prevApps_AMT_APPLICATION_mean', 'mean')], 'AMT_DOWN_PAYMENT': [('prevApps_AMT_DOWN_PAYMENT_min', 'min'), ('prevApps_AMT_DOWN_PAYMENT_max', 'max'), ('prevApps_AMT_DOWN_PAYMENT_mean', 'mean')], 'CNT_PAYMENT': [('prevApps_CNT_PAYMENT_min', 'min'), ('prevApps_CNT_PAYMENT_max', 'max'), ('prevApps_CNT_PAYMENT_mean', 'mean')], 'RATE_INTEREST_PRIVILEGED': [('prevApps_RATE_INTEREST_PRIVILEGED_min', 'min'), ('prevApps_RATE_INTEREST_PRIVILEGED_max', 'max'), ('prevApps_RATE_INTEREST_PRIVILEGED_mean', 'mean')]}
{'AMT_ANNUITY': [('bureau_AMT_ANNUITY_min', 'min'), ('bureau_AMT_ANNUITY_max', 'max'), ('bureau_AMT_ANNUITY_mean', 'mean')], 'AMT_CREDIT_SUM': [('bureau_AMT_CREDIT_SUM_min', 'min'), ('bureau_AMT_CREDIT_SUM_max', 'max'), ('bureau_AMT_CREDIT_SUM_mean', 'mean')], 'DAYS_CREDIT': [('bureau_DAYS_CREDIT_min', 'min'), ('bureau_DAYS_CREDIT_max', 'max'), ('bureau_DAYS_CREDIT_mean', 'mean')], 'AMT_CREDIT_SUM_OVERDUE': [('bureau_AMT_CREDIT_SUM_OVERDUE_min', 'min'), ('bureau_AMT_CREDIT_SUM_OVERDUE_max', 'max'), ('bureau_AMT_CREDIT_SUM_OVERDUE_mean', 'mean')], 'CREDIT_DAY_OVERDUE': [('bureau_CREDIT_DAY_OVERDUE_min', 'min'), ('bureau_CREDIT_DAY_OVERDUE_max', 'max'), ('bureau_CREDIT_DAY_OVERDUE_mean', 'mean')]}
{'MONTHS_BALANCE': [('bureau_balance_MONTHS_BALANCE_min', 'min'), ('bureau_balance_MONTHS_BALANCE_max', 'max'), ('bureau_balance_MONTHS_BALANCE_mean', 'mean')]}
{'MONTHS_BALANCE': [('credit_card_balance_MONTHS_BALANCE_min', 'min'), ('credit_card_balance_MONTHS_BALANCE_max', 'max'), ('credit_card_balance_MONTHS_BALANCE_mean', 'mean')], 'AMT_BALANCE': [('credit_card_balance_AMT_BALANCE_min', 'min'), ('credit_card_balance_AMT_BALANCE_max', 'max'), ('credit_card_balance_AMT_BALANCE_mean', 'mean')], 'CNT_INSTALMENT_MATURE_CUM': [('credit_card_balance_CNT_INSTALMENT_MATURE_CUM_min', 'min'), ('credit_card_balance_CNT_INSTALMENT_MATURE_CUM_max', 'max'), ('credit_card_balance_CNT_INSTALMENT_MATURE_CUM_mean', 'mean')], 'AMT_DRAWINGS_ATM_CURRENT': [('credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_min', 'min'), ('credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_max', 'max'), ('credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_mean', 'mean')], 'AMT_INST_MIN_REGULARITY': [('credit_card_balance_AMT_INST_MIN_REGULARITY_min', 'min'), ('credit_card_balance_AMT_INST_MIN_REGULARITY_max', 'max'), ('credit_card_balance_AMT_INST_MIN_REGULARITY_mean', 'mean')], 'AMT_PAYMENT_TOTAL_CURRENT': [('credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_min', 'min'), ('credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_max', 'max'), ('credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_mean', 'mean')]}
{'AMT_INSTALMENT': [('installments_pmnts_AMT_INSTALMENT_min', 'min'), ('installments_pmnts_AMT_INSTALMENT_max', 'max'), ('installments_pmnts_AMT_INSTALMENT_mean', 'mean')], 'AMT_PAYMENT': [('installments_pmnts_AMT_PAYMENT_min', 'min'), ('installments_pmnts_AMT_PAYMENT_max', 'max'), ('installments_pmnts_AMT_PAYMENT_mean', 'mean')]}
{'CNT_INSTALMENT_FUTURE': [('pos_cash_balance_CNT_INSTALMENT_FUTURE_min', 'min'), ('pos_cash_balance_CNT_INSTALMENT_FUTURE_max', 'max'), ('pos_cash_balance_CNT_INSTALMENT_FUTURE_mean', 'mean')], 'MONTHS_BALANCE': [('pos_cash_balance_MONTHS_BALANCE_min', 'min'), ('pos_cash_balance_MONTHS_BALANCE_max', 'max'), ('pos_cash_balance_MONTHS_BALANCE_mean', 'mean')], 'SK_DPD_DEF': [('pos_cash_balance_SK_DPD_DEF_min', 'min'), ('pos_cash_balance_SK_DPD_DEF_max', 'max'), ('pos_cash_balance_SK_DPD_DEF_mean', 'mean')]}
Engineer New features for Application Train Dataset
# Feature engineering pipeline for application_train
appln_new_features_pipeline = Pipeline([
('engineer_features', engineer_features()), # add some new features
])
# Primary Application Training Dataset
appsTrainDF = datasets['application_train']
# Secondary Datasets
prevAppsDF = datasets["previous_application"] #prev app
bureauDF = datasets["bureau"] #bureau app
bureaubalDF = datasets['bureau_balance']
ccbalDF = datasets["credit_card_balance"] #prev app
installmentspaymentsDF = datasets["installments_payments"] #bureau app
posbalDF = datasets['POS_CASH_balance']
Create Aggregate datasets after performing fit & transform
appsTrainDF_agg = appln_new_features_pipeline.fit_transform(appsTrainDF)
prevApps_agg = prevApps_features_pipeline.fit_transform(prevAppsDF)
# prevApps_agg = prevApp_new_features_pipeline.fit_transform(prevApps_agg)
bureau_agg = bureau_features_pipeline.fit_transform(bureauDF)
# bureaubal_agg = bureau_bal_features_pipeline.fit_transform(bureaubalDF)
ccblance_agg = cc_bal_features_pipeline.fit_transform(ccbalDF)
installments_pmnts_agg = installments_pmnts_features_pipeline.fit_transform(installmentspaymentsDF)
posbal_agg = pos_cash_balance_features_pipeline.fit_transform(posbalDF)
~3==3
False
datasets.keys()
dict_keys(['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance'])
Perform data merging of primary application and secondary datasets.
merge_all_data = True
# if merge_all_data:
# prevApps_aggregated = prevApps_feature_pipeline.transform(appsDF)
# merge primary table and secondary tables using features based on meta data and aggregage stats
if merge_all_data:
appsTrainDF_agg = appsTrainDF_agg.merge(prevApps_agg, how='left', on='SK_ID_CURR')
appsTrainDF_agg = appsTrainDF_agg.merge(bureau_agg, how='left', on="SK_ID_CURR")
appsTrainDF_agg = appsTrainDF_agg.merge(ccblance_agg, how='left', on="SK_ID_CURR")
appsTrainDF_agg = appsTrainDF_agg.merge(installments_pmnts_agg, how='left', on="SK_ID_CURR")
appsTrainDF_agg = appsTrainDF_agg.merge(posbal_agg, how='left', on="SK_ID_CURR")
appsTrainDF_agg.shape
(307511, 190)
print(appsTrainDF.shape)
(307511, 124)
Check presence of newly engineered features
appsTrainDF_agg[['ef_INCOME_CREDIT_PERCENT', 'ef_ANN_INCOME_PERCENT','ef_prevApps_AMT_ANNUITY_range'
, 'ef_prevApps_AMT_APPLICATION_range', 'ef_prevApps_AMT_DOWN_PAYMENT_range']].head()
| ef_INCOME_CREDIT_PERCENT | ef_ANN_INCOME_PERCENT | ef_prevApps_AMT_ANNUITY_range | ef_prevApps_AMT_APPLICATION_range | ef_prevApps_AMT_DOWN_PAYMENT_range | |
|---|---|---|---|---|---|
| 0 | 0.498036 | 0.121978 | 0.000 | 0.0 | 0.00 |
| 1 | 0.208736 | 0.132217 | 91619.685 | 831190.5 | 6885.00 |
| 2 | 0.500000 | 0.100000 | 0.000 | 0.0 | 0.00 |
| 3 | 0.431748 | 0.219900 | 37471.590 | 688500.0 | 64293.66 |
| 4 | 0.236842 | 0.179963 | 20844.495 | 230323.5 | 571.50 |
Perform data merging of primary application and secondary datasets.
X_kaggle_test = datasets["application_test"]
X_kaggle_test = appln_new_features_pipeline.fit_transform(X_kaggle_test)
merge_all_data = True
if merge_all_data:
X_kaggle_test = X_kaggle_test.merge(prevApps_agg, how='left', on='SK_ID_CURR')
X_kaggle_test = X_kaggle_test.merge(bureau_agg, how='left', on="SK_ID_CURR")
X_kaggle_test = X_kaggle_test.merge(ccblance_agg, how='left', on="SK_ID_CURR")
X_kaggle_test = X_kaggle_test.merge(installments_pmnts_agg, how='left', on="SK_ID_CURR")
X_kaggle_test = X_kaggle_test.merge(posbal_agg, how='left', on="SK_ID_CURR")
print(X_kaggle_test.shape)
X_kaggle_test.head(3)
(48744, 189)
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | ef_INCOME_CREDIT_PERCENT | ef_ANN_INCOME_PERCENT | prevApps_AMT_ANNUITY_min | prevApps_AMT_ANNUITY_max | prevApps_AMT_ANNUITY_mean | prevApps_AMT_APPLICATION_min | prevApps_AMT_APPLICATION_max | prevApps_AMT_APPLICATION_mean | prevApps_AMT_DOWN_PAYMENT_min | prevApps_AMT_DOWN_PAYMENT_max | prevApps_AMT_DOWN_PAYMENT_mean | prevApps_CNT_PAYMENT_min | prevApps_CNT_PAYMENT_max | prevApps_CNT_PAYMENT_mean | prevApps_RATE_INTEREST_PRIVILEGED_min | prevApps_RATE_INTEREST_PRIVILEGED_max | prevApps_RATE_INTEREST_PRIVILEGED_mean | ef_prevApps_AMT_ANNUITY_range | ef_prevApps_AMT_APPLICATION_range | ef_prevApps_AMT_DOWN_PAYMENT_range | bureau_AMT_ANNUITY_min | bureau_AMT_ANNUITY_max | bureau_AMT_ANNUITY_mean | bureau_AMT_CREDIT_SUM_min | bureau_AMT_CREDIT_SUM_max | bureau_AMT_CREDIT_SUM_mean | bureau_DAYS_CREDIT_min | bureau_DAYS_CREDIT_max | bureau_DAYS_CREDIT_mean | bureau_AMT_CREDIT_SUM_OVERDUE_min | bureau_AMT_CREDIT_SUM_OVERDUE_max | bureau_AMT_CREDIT_SUM_OVERDUE_mean | bureau_CREDIT_DAY_OVERDUE_min | bureau_CREDIT_DAY_OVERDUE_max | bureau_CREDIT_DAY_OVERDUE_mean | credit_card_balance_MONTHS_BALANCE_min | credit_card_balance_MONTHS_BALANCE_max | credit_card_balance_MONTHS_BALANCE_mean | credit_card_balance_AMT_BALANCE_min | credit_card_balance_AMT_BALANCE_max | credit_card_balance_AMT_BALANCE_mean | credit_card_balance_CNT_INSTALMENT_MATURE_CUM_min | credit_card_balance_CNT_INSTALMENT_MATURE_CUM_max | credit_card_balance_CNT_INSTALMENT_MATURE_CUM_mean | credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_min | credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_max | credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_mean | credit_card_balance_AMT_INST_MIN_REGULARITY_min | credit_card_balance_AMT_INST_MIN_REGULARITY_max | credit_card_balance_AMT_INST_MIN_REGULARITY_mean | credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_min | credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_max | credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_mean | installments_pmnts_AMT_INSTALMENT_min | installments_pmnts_AMT_INSTALMENT_max | installments_pmnts_AMT_INSTALMENT_mean | installments_pmnts_AMT_PAYMENT_min | installments_pmnts_AMT_PAYMENT_max | installments_pmnts_AMT_PAYMENT_mean | pos_cash_balance_CNT_INSTALMENT_FUTURE_min | pos_cash_balance_CNT_INSTALMENT_FUTURE_max | pos_cash_balance_CNT_INSTALMENT_FUTURE_mean | pos_cash_balance_MONTHS_BALANCE_min | pos_cash_balance_MONTHS_BALANCE_max | pos_cash_balance_MONTHS_BALANCE_mean | pos_cash_balance_SK_DPD_DEF_min | pos_cash_balance_SK_DPD_DEF_max | pos_cash_balance_SK_DPD_DEF_mean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.018850 | -19241 | -2329 | -5170.0 | -812 | NaN | 1 | 1 | 0 | 1 | 0 | 1 | NaN | 2.0 | 2 | 2 | TUESDAY | 18 | 0 | 0 | 0 | 0 | 0 | 0 | Kindergarten | 0.752614 | 0.789654 | 0.159520 | 0.066 | 0.059 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0505 | NaN | NaN | 0.0672 | 0.0612 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0526 | NaN | NaN | 0.0666 | 0.059 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0514 | NaN | NaN | NaN | block of flats | 0.0392 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -1740.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.237342 | 0.152300 | 3951.000 | 3951.000 | 3951.000 | 24835.5 | 24835.5 | 24835.50 | 2520.0 | 2520.0 | 2520.0 | 8.0 | 8.0 | 8.000000 | NaN | NaN | NaN | 0.00 | 0.0 | 0.0 | 0.0 | 10822.5 | 3545.357143 | 85500.00 | 378000.0 | 207623.571429 | -1572.0 | -49.0 | -735.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3951.0 | 17397.900 | 5885.132143 | 3951.000 | 17397.900 | 5885.132143 | 0.0 | 4.0 | 1.444444 | -96.0 | -53.0 | -72.555556 | 0.0 | 7.0 | 0.777778 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.035792 | -18064 | -4469 | -9118.0 | -1623 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Low-skill Laborers | 2.0 | 2 | 2 | FRIDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Self-employed | 0.564990 | 0.291656 | 0.432962 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.444409 | 0.175455 | 4813.200 | 4813.200 | 4813.200 | 0.0 | 44617.5 | 22308.75 | 4464.0 | 4464.0 | 4464.0 | 12.0 | 12.0 | 12.000000 | NaN | NaN | NaN | 0.00 | 44617.5 | 0.0 | 0.0 | 4261.5 | 1420.500000 | 29826.00 | 568800.0 | 219042.000000 | -373.0 | -62.0 | -190.666667 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4813.2 | 17656.245 | 6240.205000 | 4813.200 | 17656.245 | 6240.205000 | 0.0 | 12.0 | 7.200000 | -25.0 | -15.0 | -20.000000 | 0.0 | 0.0 | 0.000000 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | NaN | Working | Higher education | Married | House / apartment | 0.019101 | -20038 | -4458 | -2175.0 | -3503 | 5.0 | 1 | 1 | 0 | 1 | 0 | 0 | Drivers | 2.0 | 2 | 2 | MONDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | Transport: type 3 | NaN | 0.699787 | 0.610991 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -856.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.305308 | 0.344578 | 4742.415 | 23153.985 | 11478.195 | 0.0 | 450000.0 | 130871.25 | 0.0 | 6750.0 | 3375.0 | 6.0 | 36.0 | 17.333333 | NaN | NaN | NaN | 18411.57 | 450000.0 | 6750.0 | 0.0 | 0.0 | 0.000000 | 26490.06 | 1262250.0 | 518070.015000 | -2070.0 | -1210.0 | -1737.500000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -96.0 | -1.0 | -48.5 | 0.0 | 161420.22 | 18159.919219 | 1.0 | 22.0 | 18.719101 | 0.0 | 157500.0 | 6350.0 | 0.0 | 7875.0 | 1454.539551 | 0.0 | 153675.0 | 6817.172344 | 67.5 | 357347.745 | 10897.898516 | 6.165 | 357347.745 | 9740.235774 | 0.0 | 36.0 | 15.305556 | -66.0 | -3.0 | -29.555556 | 0.0 | 0.0 | 0.000000 |
appsTrainDF_agg.columns.tolist()
['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'ef_INCOME_CREDIT_PERCENT', 'ef_ANN_INCOME_PERCENT', 'prevApps_AMT_ANNUITY_min', 'prevApps_AMT_ANNUITY_max', 'prevApps_AMT_ANNUITY_mean', 'prevApps_AMT_APPLICATION_min', 'prevApps_AMT_APPLICATION_max', 'prevApps_AMT_APPLICATION_mean', 'prevApps_AMT_DOWN_PAYMENT_min', 'prevApps_AMT_DOWN_PAYMENT_max', 'prevApps_AMT_DOWN_PAYMENT_mean', 'prevApps_CNT_PAYMENT_min', 'prevApps_CNT_PAYMENT_max', 'prevApps_CNT_PAYMENT_mean', 'prevApps_RATE_INTEREST_PRIVILEGED_min', 'prevApps_RATE_INTEREST_PRIVILEGED_max', 'prevApps_RATE_INTEREST_PRIVILEGED_mean', 'ef_prevApps_AMT_ANNUITY_range', 'ef_prevApps_AMT_APPLICATION_range', 'ef_prevApps_AMT_DOWN_PAYMENT_range', 'bureau_AMT_ANNUITY_min', 'bureau_AMT_ANNUITY_max', 'bureau_AMT_ANNUITY_mean', 'bureau_AMT_CREDIT_SUM_min', 'bureau_AMT_CREDIT_SUM_max', 'bureau_AMT_CREDIT_SUM_mean', 'bureau_DAYS_CREDIT_min', 'bureau_DAYS_CREDIT_max', 'bureau_DAYS_CREDIT_mean', 'bureau_AMT_CREDIT_SUM_OVERDUE_min', 'bureau_AMT_CREDIT_SUM_OVERDUE_max', 'bureau_AMT_CREDIT_SUM_OVERDUE_mean', 'bureau_CREDIT_DAY_OVERDUE_min', 'bureau_CREDIT_DAY_OVERDUE_max', 'bureau_CREDIT_DAY_OVERDUE_mean', 'credit_card_balance_MONTHS_BALANCE_min', 'credit_card_balance_MONTHS_BALANCE_max', 'credit_card_balance_MONTHS_BALANCE_mean', 'credit_card_balance_AMT_BALANCE_min', 'credit_card_balance_AMT_BALANCE_max', 'credit_card_balance_AMT_BALANCE_mean', 'credit_card_balance_CNT_INSTALMENT_MATURE_CUM_min', 'credit_card_balance_CNT_INSTALMENT_MATURE_CUM_max', 'credit_card_balance_CNT_INSTALMENT_MATURE_CUM_mean', 'credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_min', 'credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_max', 'credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_mean', 'credit_card_balance_AMT_INST_MIN_REGULARITY_min', 'credit_card_balance_AMT_INST_MIN_REGULARITY_max', 'credit_card_balance_AMT_INST_MIN_REGULARITY_mean', 'credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_min', 'credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_max', 'credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_mean', 'installments_pmnts_AMT_INSTALMENT_min', 'installments_pmnts_AMT_INSTALMENT_max', 'installments_pmnts_AMT_INSTALMENT_mean', 'installments_pmnts_AMT_PAYMENT_min', 'installments_pmnts_AMT_PAYMENT_max', 'installments_pmnts_AMT_PAYMENT_mean', 'pos_cash_balance_CNT_INSTALMENT_FUTURE_min', 'pos_cash_balance_CNT_INSTALMENT_FUTURE_max', 'pos_cash_balance_CNT_INSTALMENT_FUTURE_mean', 'pos_cash_balance_MONTHS_BALANCE_min', 'pos_cash_balance_MONTHS_BALANCE_max', 'pos_cash_balance_MONTHS_BALANCE_mean', 'pos_cash_balance_SK_DPD_DEF_min', 'pos_cash_balance_SK_DPD_DEF_max', 'pos_cash_balance_SK_DPD_DEF_mean']
# approval rate 'NFLAG_INSURED_ON_APPROVAL'
# Convert categorical features to numerical approximations (via pipeline)
class ClaimAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
charlson_idx_dt = {'0': 0, '1-2': 2, '3-4': 4, '5+': 6}
los_dt = {'1 day': 1, '2 days': 2, '3 days': 3, '4 days': 4, '5 days': 5, '6 days': 6,
'1- 2 weeks': 11, '2- 4 weeks': 21, '4- 8 weeks': 42, '26+ weeks': 180}
X['PayDelay'] = X['PayDelay'].apply(lambda x: int(x) if x != '162+' else int(162))
X['DSFS'] = X['DSFS'].apply(lambda x: None if pd.isnull(x) else int(x[0]) + 1)
X['CharlsonIndex'] = X['CharlsonIndex'].apply(lambda x: charlson_idx_dt[x])
X['LengthOfStay'] = X['LengthOfStay'].apply(lambda x: None if pd.isnull(x) else los_dt[x])
return X
Train, validation and Test sets (and the leakage problem we have mentioned previously):
Let's look at a small usecase to tell us how to deal with this:
ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.
Here is a example that in action:
# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
Please this blog for more details of OHE when the validation/test have previously unseen unique values.
#train_dataset = datasets["application_train"]
train_dataset=appsTrainDF
class_labels = ["No Default","Default"]
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# Identify the numeric features we wish to consider.
num_attribs = [
'AMT_INCOME_TOTAL',
'AMT_CREDIT',
'EXT_SOURCE_3',
'EXT_SOURCE_2',
'EXT_SOURCE_1',
'DAYS_EMPLOYED',
'DAYS_BIRTH',
'FLOORSMAX_AVG',
'FLOORSMAX_MEDI',
'FLOORSMAX_MODE',
'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE',
'REG_CITY_NOT_LIVE_CITY',
'FLAG_EMP_PHONE',
'REG_CITY_NOT_WORK_CITY',
'DAYS_ID_PUBLISH',
'DAYS_LAST_PHONE_CHANGE',
'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY',
'ef_INCOME_CREDIT_PERCENT',
'ef_ANN_INCOME_PERCENT',
## Highly correlated previous applications
'prevApps_AMT_ANNUITY_mean',
'prevApps_AMT_DOWN_PAYMENT_min',
'prevApps_AMT_DOWN_PAYMENT_mean',
'prevApps_CNT_PAYMENT_max',
'prevApps_RATE_INTEREST_PRIVILEGED_mean',
'ef_prevApps_AMT_ANNUITY_range',
'ef_prevApps_AMT_APPLICATION_range',
'ef_prevApps_AMT_DOWN_PAYMENT_range',
## Highly correlated Bureau features
'bureau_AMT_ANNUITY_mean',
'bureau_AMT_CREDIT_SUM_mean',
'bureau_DAYS_CREDIT_mean',
'bureau_DAYS_CREDIT_max',
## Highly correlated Installment Payment features
'installments_pmnts_AMT_INSTALMENT_min',
'installments_pmnts_AMT_INSTALMENT_max',
'installments_pmnts_AMT_INSTALMENT_mean',
'installments_pmnts_AMT_PAYMENT_mean',
## Highly correlated Credit card balance features
'credit_card_balance_MONTHS_BALANCE_min',
'credit_card_balance_MONTHS_BALANCE_max',
'credit_card_balance_MONTHS_BALANCE_mean',
'credit_card_balance_AMT_BALANCE_min',
'credit_card_balance_AMT_BALANCE_max',
'credit_card_balance_AMT_BALANCE_mean',
'credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_mean',
'credit_card_balance_AMT_INST_MIN_REGULARITY_mean',
'credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_mean',
## Highly correlated POS balance features
'pos_cash_balance_CNT_INSTALMENT_FUTURE_min',
'pos_cash_balance_CNT_INSTALMENT_FUTURE_max',
'pos_cash_balance_CNT_INSTALMENT_FUTURE_mean',
'pos_cash_balance_MONTHS_BALANCE_mean'
]
print('Number of numerical features: ', len(num_attribs))
Number of numerical features: 50
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', SimpleImputer(strategy='median')),
('std_scaler', StandardScaler()),
])
# Identify the categorical features we wish to consider.
# cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
# 'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
print('Number of numerical features: ', len(cat_attribs))
Number of numerical features: 7
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
Use ColumnTransformer instead of FeatureUnion
from sklearn.compose import ColumnTransformer
# data_prep_pipeline = FeatureUnion(transformer_list=[
# ("num_pipeline", num_pipeline),
# ("cat_pipeline", cat_pipeline),
# ])
data_prep_pipeline = ColumnTransformer(transformers=[
#( name, transformer, columns)
("num_pipeline", num_pipeline, num_attribs),
("cat_pipeline", cat_pipeline, cat_attribs),
],
n_jobs=-1
)
print('Numerical Feature Family: ', num_attribs)
print('Numerical Feature Count: ', len(num_attribs))
print('--------------------------')
print('Categorical Feature Family: ', cat_attribs)
print('Categorical Feature Count: ', len(cat_attribs))
print('--------------------------')
print('Total Number of Input Features: ', len(num_attribs) + len(cat_attribs))
Numerical Feature Family: ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'EXT_SOURCE_3', 'EXT_SOURCE_2', 'EXT_SOURCE_1', 'DAYS_EMPLOYED', 'DAYS_BIRTH', 'FLOORSMAX_AVG', 'FLOORSMAX_MEDI', 'FLOORSMAX_MODE', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'REG_CITY_NOT_LIVE_CITY', 'FLAG_EMP_PHONE', 'REG_CITY_NOT_WORK_CITY', 'DAYS_ID_PUBLISH', 'DAYS_LAST_PHONE_CHANGE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ef_INCOME_CREDIT_PERCENT', 'ef_ANN_INCOME_PERCENT', 'prevApps_AMT_ANNUITY_mean', 'prevApps_AMT_DOWN_PAYMENT_min', 'prevApps_AMT_DOWN_PAYMENT_mean', 'prevApps_CNT_PAYMENT_max', 'prevApps_RATE_INTEREST_PRIVILEGED_mean', 'ef_prevApps_AMT_ANNUITY_range', 'ef_prevApps_AMT_APPLICATION_range', 'ef_prevApps_AMT_DOWN_PAYMENT_range', 'bureau_AMT_ANNUITY_mean', 'bureau_AMT_CREDIT_SUM_mean', 'bureau_DAYS_CREDIT_mean', 'bureau_DAYS_CREDIT_max', 'installments_pmnts_AMT_INSTALMENT_min', 'installments_pmnts_AMT_INSTALMENT_max', 'installments_pmnts_AMT_INSTALMENT_mean', 'installments_pmnts_AMT_PAYMENT_mean', 'credit_card_balance_MONTHS_BALANCE_min', 'credit_card_balance_MONTHS_BALANCE_max', 'credit_card_balance_MONTHS_BALANCE_mean', 'credit_card_balance_AMT_BALANCE_min', 'credit_card_balance_AMT_BALANCE_max', 'credit_card_balance_AMT_BALANCE_mean', 'credit_card_balance_AMT_DRAWINGS_ATM_CURRENT_mean', 'credit_card_balance_AMT_INST_MIN_REGULARITY_mean', 'credit_card_balance_AMT_PAYMENT_TOTAL_CURRENT_mean', 'pos_cash_balance_CNT_INSTALMENT_FUTURE_min', 'pos_cash_balance_CNT_INSTALMENT_FUTURE_max', 'pos_cash_balance_CNT_INSTALMENT_FUTURE_mean', 'pos_cash_balance_MONTHS_BALANCE_mean'] Numerical Feature Count: 50 -------------------------- Categorical Feature Family: ['CODE_GENDER', 'FLAG_OWN_REALTY', 'FLAG_OWN_CAR', 'NAME_CONTRACT_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'NAME_INCOME_TYPE'] Categorical Feature Count: 7 -------------------------- Total Number of Input Features: 57
selected_features = num_attribs + cat_attribs
len(selected_features)
57
# from sklearn.base import BaseEstimator, TransformerMixin
# import re
# # Creates the following date features
# # But could do so much more with these features
# # E.g.,
# # extract the domain address of the homepage and OneHotEncode it
# #
# # ['release_month','release_day','release_year', 'release_dayofweek','release_quarter']
# class prep_OCCUPATION_TYPE(BaseEstimator, TransformerMixin):
# def __init__(self, features="OCCUPATION_TYPE"): # no *args or **kargs
# self.features = features
# def fit(self, X, y=None):
# return self # nothing else to do
# def transform(self, X):
# df = pd.DataFrame(X, columns=self.features)
# #from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
# # df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(lambda x: 1. if x in ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'] else 0.)
# #df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(map(lambda x: 1. if x.lower() in ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'] else 0.))
# df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(lambda x: 1. if x in list(map(lambda x: x.lower(), ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'])) else 0.)
# #df.drop(self.features, axis=1, inplace=True)
# return np.array(df.values) #return a Numpy Array to observe the pipeline protocol
# from sklearn.pipeline import make_pipeline
# features = ["OCCUPATION_TYPE"]
# def test_driver_prep_OCCUPATION_TYPE():
# print(f"X_train.shape: {X_train.shape}\n")
# print(f"X_train['name'][0:5]: \n{X_train[features][0:5]}")
# test_pipeline = make_pipeline(prep_OCCUPATION_TYPE(features))
# return(test_pipeline.fit_transform(X_train))
# x = test_driver_prep_OCCUPATION_TYPE()
# print(f"Test driver: \n{test_driver_prep_OCCUPATION_TYPE()[0:10, :]}")
# print(f"X_train['name'][0:10]: \n{X_train[features][0:10]}")
# # QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
list(datasets["application_train"].columns)
['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'ef_INCOME_CREDIT_PERCENT', 'ef_ANN_INCOME_PERCENT']
To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model
def pct(x):
return round(100*x,3)
try:
expLog
except NameError:
expLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC",
"P Score",
"Train RMSE",
"Valid RMSE",
"Test RMSE",
"Train MAE",
"Valid MAE",
"Test MAE",
"Train Time",
"Test Time",
"Description"
])
train_dataset = appsTrainDF_agg
train_dataset.shape
(307511, 190)
# Split Sample to feed the pipeline and it will result in a new dataset that is (1 / splits) the size
splits = 3
# Train Test split percentage
subsample_rate = 0.3
finaldf = np.array_split(train_dataset, splits)
X_train = finaldf[0][selected_features]
y_train = finaldf[0]['TARGET']
X_kaggle_test= X_kaggle_test[selected_features]
## split part of data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
test_size=subsample_rate, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,test_size=0.15, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X X_kaggle_test shape: {X_kaggle_test.shape}")
X train shape: (60989, 57) X validation shape: (10763, 57) X test shape: (30752, 57) X X_kaggle_test shape: (48744, 57)
%%time
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("logistic", LogisticRegression())
])
CPU times: user 5.12 ms, sys: 0 ns, total: 5.12 ms Wall time: 5.56 ms
Split the training data to 10 fold to perform Crossfold validation
# Import Model selection libraries
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
# Use ShuffleSplit() with 30 splits, 30% test_size and random state of 0
cvSplits = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
from time import time, ctime
start = time()
model = full_pipeline_with_predictor.fit(X_train, y_train)
np.random.seed(42)
# Set up cross validation scores
logit_scores = cross_val_score(full_pipeline_with_predictor, X_train, y_train, cv=cvSplits)
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)
# Time and score test predictions
start = time()
logit_score_test = full_pipeline_with_predictor.score(X_test, y_test)
test_time = np.round(time() - start, 4)
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.
from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score, roc_auc_score
from scipy import stats
exp_name = f"Baseline_{len(selected_features)}_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, model.predict(X_train)),
accuracy_score(y_valid, model.predict(X_valid)),
accuracy_score(y_test, model.predict(X_test)),
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
0, # p-value not relevant for Baseline model
np.round(np.sqrt(mean_squared_error(y_train, model.predict_proba(X_train)[:, 1])), 3),
np.round(np.sqrt(mean_squared_error(y_valid, model.predict_proba(X_valid)[:, 1])), 3),
np.round(np.sqrt(mean_squared_error(y_test, model.predict_proba(X_test)[:, 1])), 3),
np.round(mean_absolute_error(y_train, model.predict_proba(X_train)[:, 1]), 3),
np.round(mean_absolute_error(y_valid, model.predict_proba(X_valid)[:, 1]), 3),
np.round(mean_absolute_error(y_test, model.predict_proba(X_test)[:, 1]), 3)
], 4)) \
+ [train_time,test_time] + [f"Baseline LR {len(selected_features)}"]
expLog
| exp_name | Train Acc | Valid Acc | Test Acc | Train AUC | Valid AUC | Test AUC | P Score | Train RMSE | Valid RMSE | Test RMSE | Train MAE | Valid MAE | Test MAE | Train Time | Test Time | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline_57_features | 0.9189 | 0.9194 | 0.9184 | 0.7538 | 0.744 | 0.7478 | 0.0 | 0.262 | 0.262 | 0.263 | 0.137 | 0.138 | 0.137 | 21.7724 | 0.6446 | Baseline LR 57 |
from sklearn.metrics import confusion_matrix
def confusion_matrix_def(model,X_train,y_train,X_test,y_test):
#Prediction
preds_test = model.predict(X_test)
preds_train = model.predict(X_train)
cm_train = confusion_matrix(y_train, preds_train).astype(np.float32)
#print(cm_train)
cm_train /= cm_train.sum(axis=1)[:, np.newaxis]
cm_test = confusion_matrix(y_test, preds_test).astype(np.float32)
#print(cm_test)
cm_test /= cm_test.sum(axis=1)[:, np.newaxis]
class_labels = ['No Default','Default']
plt.figure(figsize=(20, 8))
plt.subplot(121)
g = sns.heatmap(cm_train, vmin=0, vmax=1, annot=True, cmap="Reds")
plt.xlabel("Predicted", fontsize=14)
plt.ylabel("True", fontsize=14)
g.set(xticklabels=class_labels, yticklabels=class_labels)
plt.title("Train", fontsize=14)
plt.subplot(122)
g = sns.heatmap(cm_test, vmin=0, vmax=1, annot=True, cmap="Reds")
plt.xlabel("Predicted", fontsize=14)
plt.ylabel("True", fontsize=14)
g.set(xticklabels=class_labels, yticklabels=class_labels)
plt.title("Test", fontsize=14) ;
# Confusion matrix for baseline model
confusion_matrix_def(model,X_train,y_train,X_test,y_test)
plt.show()
The baseline Logistic Regression model was tuned across different parameters evaluated for the following metrics:
import json
classifiers = [
('Logistic Regression', LogisticRegression(random_state=42))
]
params_grid = {
'Logistic Regression': {
'penalty': ('l1', 'l2'),
'tol': (0.0001, 0.00001, 0.0000001),
'C': (10, 1, 0.1, 0.01),
}
}
results = [logit_scores]
names = ['Baseline LR']
for (name, classifier) in classifiers:
# Print classifier and parameters
print('****** START', name,'*****')
parameters = params_grid[name]
print("Parameters:")
for p in sorted(parameters.keys()):
print("\t"+str(p)+": "+ str(parameters[p]))
# generate the pipeline
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("predictor", classifier)
])
# Execute the grid search
params = {}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=5,
n_jobs=-1,verbose=1)
grid_search.fit(X_train, y_train)
# Best estimator score
best_train = pct(grid_search.best_score_)
# Best estimator fitting time
start = time()
model = grid_search.best_estimator_.fit(X_train, y_train)
train_time = round(time() - start, 4)
# Best estimator prediction time
start = time()
best_test_accuracy = pct(grid_search.best_estimator_.score(X_test, y_test))
test_time = round(time() - start, 4)
# Best train scores
best_train_scores = cross_val_score(grid_search.best_estimator_, X_train, y_train,cv=cvSplits)
best_train_accuracy = pct(best_train_scores.mean())
results.append(best_train_scores)
names.append(name)
# Conduct t-test with baseline logit (control) and best estimator (experiment)
(t_stat, p_value) = stats.ttest_rel(logit_scores, best_train_scores)
# Create confusion matrix for the best model
confusion_matrix_def(model,X_train,y_train,X_test,y_test)
plt.show()
# Collect the best parameters found by the grid search
print("Best Parameters:")
best_parameters = grid_search.best_estimator_.get_params()
param_dump = []
for param_name in sorted(params.keys()):
param_dump.append((param_name, best_parameters[param_name]))
print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
print("****** FINISH",name," *****")
print("")
# Record the results
exp_name = name
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, model.predict(X_train)),
accuracy_score(y_valid, model.predict(X_valid)),
accuracy_score(y_test, model.predict(X_test)),
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
p_value,
np.round(np.sqrt(mean_squared_error(y_train, model.predict_proba(X_train)[:, 1])), 3),
np.round(np.sqrt(mean_squared_error(y_valid, model.predict_proba(X_valid)[:, 1])), 3),
np.round(np.sqrt(mean_squared_error(y_test, model.predict_proba(X_test)[:, 1])), 3),
np.round(mean_absolute_error(y_train, model.predict_proba(X_train)[:, 1]), 3),
np.round(mean_absolute_error(y_valid, model.predict_proba(X_valid)[:, 1]), 3),
np.round(mean_absolute_error(y_test, model.predict_proba(X_test)[:, 1]), 3)
], 4)) \
+ [train_time,test_time] \
+ [json.dumps(param_dump)]
****** START Logistic Regression *****
Parameters:
C: (10, 1, 0.1, 0.01)
penalty: ('l1', 'l2')
tol: (0.0001, 1e-05, 1e-07)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best Parameters: predictor__C: 0.01 predictor__penalty: l2 predictor__tol: 0.0001 ****** FINISH Logistic Regression *****
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Classification Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.grid()
plt.show()
print('Final experiment results:')
expLog
Final experiment results:
| exp_name | Train Acc | Valid Acc | Test Acc | Train AUC | Valid AUC | Test AUC | P Score | Train RMSE | Valid RMSE | Test RMSE | Train MAE | Valid MAE | Test MAE | Train Time | Test Time | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline_57_features | 0.9189 | 0.9194 | 0.9184 | 0.7538 | 0.7440 | 0.7478 | 0.000 | 0.262 | 0.262 | 0.263 | 0.137 | 0.138 | 0.137 | 21.7724 | 0.6446 | Baseline LR 57 |
| 1 | Logistic Regression | 0.9188 | 0.9195 | 0.9186 | 0.7509 | 0.7433 | 0.7464 | 0.051 | 0.262 | 0.262 | 0.263 | 0.138 | 0.138 | 0.138 | 1.6323 | 0.5853 | [["predictor__C", 0.01], ["predictor__penalty"... |
For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
test_class_scores = model.predict_proba(X_kaggle_test)[:, 1]
test_class_scores[0:10]
array([0.06296224, 0.22933675, 0.04993455, 0.02661781, 0.11735878,
0.05061509, 0.02893913, 0.06891914, 0.01692273, 0.09845579])
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = test_class_scores
submit_df.head()
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.062962 |
| 1 | 100005 | 0.229337 |
| 2 | 100013 | 0.049935 |
| 3 | 100028 | 0.026618 |
| 4 | 100038 | 0.117359 |
submit_df.to_csv("submission.csv",index=False)
! kaggle competitions submit -c home-credit-default-risk -f submission.csv -m "baseline submission"
100%|███████████████████████████████████████| 1.26M/1.26M [00:02<00:00, 587kB/s] Successfully submitted to Home Credit Default Risk

The primary purpose of the HCDR project is to create a machine learning model which can accurately predict the customer behavior on repayment of the loan.
In the first phase of this project, we conducted basic exploratory data analysis on all the datasets, created a baseline pipeline, and selected key metrics. We then conducted a statistical analysis of the numerical and categorical features. By doing feature engineering for the highly correlated features, we were able to evaluate a better baseline.
The results we obtained in this phase indicate that there is no statistical significance between our baseline and best performing model, since the P-value of 0.051 indicates no statistical significance between the experiments. Both models have a 91.9% accuracy score and a 75% AUC across the training, validation, and test datasets.
Our ROC_AUC score for the Kaggle submission was 0.74306.
Home Credit is an international non-bank financial institution that aims to lend people money regardless of their credit history. Home credit groups focus on providing a positive borrowing experience for customers who do not bank on traditional sources. Thus, Home Credit Group published a dataset on Kaggle with the goal of identifying and solving unfair loan rejection.
The purpose of this project is to create a machine learning model which can accurately predict the customer behavior on repayment of the loan. Our task is to form a pipeline to build a baseline machine learning model using logistic regression classification algorithms. The final model will be evaluated using a number of different performance metrics that we can use to create a better model. Businesses can use this model to identify if a loan is at risk to default. The new model that is built will ensure that the clients who are capable of repaying their loans are not rejected and that loans would be given with a principal, maturity, and repayment calendar that will allow their clients to be successful.
The results of the machine learning pipelines are measured by using these metrics: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Accuracy Score, Precision, Recall, Confusion Matrix, and Area Under ROC Curve (AUC).
The results of our pipelines will be analyzed and ranked. The most efficient pipeline will be submitted to the Kaggle competition for the Home Credit Default Risk (HCDR).
Workflow
We are implementing the following workflow outlined below. In Phase 0, we understood the project modelling requirements and outlined our plans. In Phase 1, we are performing the first among three iterations of the remainder of the workflow

The dataset contains 1 primary table and 6 seconday tables. \ \ Primary Tables
application_train \ This Primary table includes the application information for each loan application at Home Credit in one row. This row includes the target variable of whether or not the loan was repaid. We use this field as the basis to determine the feature importance. The target variable is binary in nature based since this is a classification problem. \ \ The target variable takes on two different values:
application_test \ This table includes the application information for each loan application at Home Credit in one row. The features are the same as the train data but exclude the target variable. \ \ There are 121 variables and 48,744 data entries.
Secondary Tables
Bureau \ This table includes all previous credits received by a customer from other financial institutions prior to their loan application. There is one row for each previous credit, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 17 variables and 1,716,428 data entries.
Bureau Balance \ This table includes the monthly balance for a previous credit at other financial institutions. There is one row for each monthly balance, meaning a many-to-one relationship with the Bureau table. We could join it with bureau table by using bureau's ID, SK_ID_BUREAU. \ \ There are 3 variables and 27,299,925 data entries.
Previous Application \ This table includes previous applications for loans made by the customer at Home Credit. There is one row for each previous application, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR. There are four types of contracts: a. Consumer loan(POS – Credit limit given to buy consumer goods) b. Cash loan(Client is given cash) c. Revolving loan(Credit) d. XNA (Contract type without values) \ \ There are 37 variables and 1,670,214 data entries.
POS CASH Balance \ This table includes a monthly balance snapshot of a previous point of sale or cash loan that the customer has at Home Credit. There is one row for each monthly balance, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 8 variables and 10,001,358 data entries.
Credit Card Balance \ This table includes a monthly balance snapshot of previous credit cards the customer has with Home Credit. There is one row for each previous monthly balance, meaning a many-to-one relationship with the Previous Application table.We could join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 23 variables and 3,840,312 data entries.
Installments Payments \ This table includes previous repayments made or not made by the customer on credits issued by Home Credit. There is one row for each payment or missed payment, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 8 variables and 13,605,401 data entries.
The following data preprocessing tasks need to be achieved to prepare the datasets after downloading and unzipping the main application and secondary datasets:
For the Exploratory Data Analysis component of this phase, we did a precursor analysis on the data to ensure that our results would be accurate.
We looked at summary statistics for each table in the model. We primarily focused on the data distribution, identifying statistics such as the count, mean, standard deviation, minimum, IQR, and maximum.
We also looked at specific numerical and categorical features and visualized them. We created a heatmap to identify the correlation between each feature and the target variable. We also visualized the age, occupation, and distribution of credit amounts.
Please see the Exploratory Data Analysis section for our complete EDA.
In our feature engineering process, we created two types of features to enhance our dataset. First, we created new aggregate features based on aggregate functions to capture the minimum, maximum, and mean of numerical attributes across the primary and secondary datasets that were highly correlated with the target variable.

We decided to engineer the following new features from the Application and Previous Application datasets:
Application_Train:
Previous_Application: (all the below values are range calculations that take the difference between the max and min aggregate values)
In order to identify the highly correlated features, we created a simple function that took a secondary dataframe name as an input variable and generated a correlation matrix between all the features in the inputted dataframe and the primary dataset's target variable.
All the aggregate values were calculated from the original dataframes and a new of dataframes (comprising of primary and secondary datasets) were generated. After the secondary datasets were merged with the primary "application_train" dataset, the new consolidated application training dataframe had a total of 187 features (including the aggregate calculations for specific features).
Further, the top highly correlated features (positive and negative) were chosen from both the primary and secondary datasets. These features were then classified into numerical and categorical variables to form inputs for 2 individual pipelines. In total, our baseline model comprised of 53 features (46 numerical and 7 catgorical features).
(Please see Feature Engineering section and Feature Aggregator for more details)
Implementing Logistic Regression as a baseline model is a good starting point for classification tasks due to its easy implementation and low computational requirements. For the first experiment, we combined our data preparation pipeline and Logistic Regression with deafult parameters (penalty = 'l2', C = 1.0, solver = 'lbfgs', tol = 1e-4). We wanted to fine tune the regularization (l1 vs l2), tolerance, and C hyper parameters using Grid Search and compare the resulting best estimator with the baseline model. We used 5 fold cross-validation along with the hyperparameters to tune the model with GridSearchCV function in Scikit-learn.
Here is the high-level workflow for the model pipeline followed by detailed steps:

Two experiments were conducted in total to develop a baseline model.
Below is the resulting table for the two baseline models we developed on the given dataset.

Since HCDR is a Classification task, we used the following metrics to measure the Model performance.
MAE
The mean absolute error is the average of the absolute values of individual prediction errors over all instances in the test set. Each prediction error is the difference between the true value and the predicted value for the instance.
$$ \text{MAE}(\mathbf{X}, h_{\mathbf{\theta}}) = \dfrac{1}{m} \sum\limits_{i=1}^{m}{| \mathbf{x}^{(i)}\cdot \mathbf{\theta} - y^{(i)}|} $$RMSE
This root mean square error is the normalized distance between the vector of predicted values and the vector of observed values. First, the squared difference between each observed value and predicted value is calculated. RMSE is the square root of the summation of these squared differences.
$$ \text{RMSE}(\mathbf{X}, h_{\mathbf{\theta}}) = \sqrt{\dfrac{1}{m} \sum\limits_{i=1}^{m}{( \mathbf{x}^{(i)}\cdot \mathbf{\theta} - y^{(i)})^2}} $$Accuracy Score
This metric describes the fraction of correctly classified samples. In SKLearn, it can be modified to return solely the number of correct samples.Accuracy is the default scoring method for both logistic regression and k-Nearest Neighbors in scikit-learn.

Precision
The precision is the ratio of true positives over the total number of predicted positives.

Recall
The recall is the ratio of true positives over the true positives and false negatives. Recall is assessing the ability of the classifier to find all the positive samples. The best value is 1 and the worst value is 0

Confusion Matrix
The confusion matrix, in this case for a binary classification, is a 2x2 matrix that contains the count of the true positives, false positives, true negatives, and false negatives.

AUC (Area under ROC curve)
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: ▪ True Positive Rate ▪ False Positive Rate

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1).

AUC is desirable for the following two reasons:
p-value
p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.
We will compare the classifiers with the baseline untuned model by conducting two-tailed hypothesis test.
Null Hypothesis, H0: There is no significant difference between the two machine learning pipelines. Alternate Hypothesis, HA: The two machine learning pipelines are different. A p-value less than or equal to the significance level is considered statistically significant.
In our first experiment for our baseline model, we received fairly high accuracy scores for our training and validation datasets at around 91.9%. Our test accuracy score was also similarly placed at 91.8%. While this suggests a robust model, the AUC values for training, test, and validation datasets are only between 74% to 75%. This means that there is a significant enough probabilty for false positives or false negatives. Our second experiment, in which we applied Grid Search to identify the best hyperparameters for Logistic Regression, also exhibited similar accuracy and AUC metrics. In both experiments, the RMSE and MAE values were fairly low. In addition, the tuned algorithm from Grid Search did not achieve statistical significance but came very close as the p-value was 0.051. Therefore, we fail to reject our null hypothesis as there is evidence to suggest that there is no significant difference between the two pipelines.
We believe that a high accuracy score and somewhat lower AUC curves might be partly the result of an imbalanced dataset with respect to the target variable values. From our EDA, we found that nearly 92% of our application training data had a target value of 0 (no default) with the remaining 8% having a value of 1. With such a low sample of data for loans that defaulted, it is likely that the accuracy score will end up being high as there is less scope for false positives or false negatives.
For our Kaggle submission, we used the baseline model with best parameters (second experiment) since the test accuracy was slightly better.
We opted out of using Mean Absolute Percentage Error (MAPE) as an evaluation metric (as originally proposed in Phase 0) because we were getting an undefined output ('Inf') due to division by zero in the denominator.
In the Home Credit Default Risk (HCDR) project, we are using Home Credit’s data to better predict repayment of a loan by a customer with little to no credit history.
In Phase 1 of our project, we designed and developed a process to ingest Home Credit's application and secondary client data, analyze the dataset features, transform and engineer the best parameters, and evaluate machine learning algorithms. Our workflow led us to a tuned algorithm that was not statistically significant compared to untuned logistic regression but came close with a p-value of 0.051
We developed and evaluated a baseline model using tuned logistic regression. Though our model exhibited high accuracy scores, there is still some room for improvement with respect to the AUC scores. In Phase 2, we plan to refine our model training process by adding additional features from the bureau datasets. While evaluating other algorithms listed in our project proposal, we also plan to capture log loss as part of our model evaluation metrics in order to get a complete picture of model performance.
One of the major challenges faced was identifying a method to select the most relevant features. It was initially difficult to achieve a balance between too few or too many features for our baseline model. Also, as mentioned in the Conclusion, our team maintains a healthy skepticism of the results as we believe we need to resample our dataset in the next phase. We also had to focus a lot of time on getting our code to work smoothly and ensuring the basic data transformations, feature engineering, and set-up is correct in addition to analyzing the results of our baseline model.
Along the way, we faced several technical issues in developing this notebook:
Below is the screenshot of our best kaggle submission.

Read the following: